Random Forest in Machine Learning

Through the Looking Glass

Forest Foresight (Advisor: Dr. Seals)

November 12, 2024

Random Forest in Machine Learning: Through the Looking Glass

A Random Forest Guided Tour

by Gérard Biau and Erwan Scornet [1]

  • Origin & Success: Introduced by Breiman (2001) [2], Random Forests excel in classification/regression, combining decision trees for strong performance.
  • Versatility: Effective for large-scale tasks, adaptable, and highlights important features across various domains.
  • Ease of Use: Simple with minimal tuning, handles small samples and high-dimensional data.
  • Theoretical Gaps: Limited theoretical insights; known for complexity and black-box nature.
  • Key Mechanisms: Uses bagging and CART-split criteria for robust performance, though hard to analyze rigorously.

Tree Prediction

Each tree estimates the response at point \(x\) as: \[ m_n(x; \Theta_j, D_n) = \frac{\sum_{i \in D_n(\Theta_j)} \mathbf{1}_{X_i \in A_n(x; \Theta_j, D_n)} Y_i}{N_n(x; \Theta_j, D_n)} \] where \(D_n(\Theta_j)\) is the resampled data subset, \(A_n(x; \Theta_j, D_n)\) is the cell containing \(x\), and \(N_n(x; \Theta_j, D_n)\) is the count of points in the cell.

Forest Prediction

The forest estimate for \(M\) trees is: \[ m_{M, n}(x) = \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) \] where \(M\) is the total number of trees, \(m_n(x; \Theta_j, D_n)\) represents the prediction from each tree, and the forest average yields the final prediction.

Regression - Tree Construction

Splitting Criteria:

  • The CART-split criterion is used to find the best cut:

\[ L_{\text{reg},n}(j, z) = \frac{1}{N_n(A)} \sum_{i=1}^{n} (Y_i - \bar{Y}_A)^2 \mathbf{1}_{X_i \in A} - \frac{1}{N_n(A)} \left(\sum_{i=1}^{n} (Y_i - \bar{Y}_{AL})^2 \mathbf{1}_{X_i \in AL} + \sum_{i=1}^{n} (Y_i - \bar{Y}_{AR})^2 \mathbf{1}_{X_i \in AR}\right) \]

  • \(N_n(A)\): Number of data points in cell \(A\).
  • \(Y_i\): Response variable for observation \(i\).
  • \(\bar{Y}_A\): Mean of \(Y_i\) in cell \(A\).
  • \(AL\) and \(AR\): Left and right child nodes after the split.

Stopping Condition:

Nodes are not split if they contain fewer than nodesize points or if all \(X_i\) in the node are identical.


Prediction:

\[ m_{M, n}(x) = \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) \]

  • \(M\): Total number of trees in the forest.
  • \(m_n(x; \Theta_j, D_n)\): Prediction from the \(j\)-th tree.

Classification

Splitting Criteria:

  • The Gini impurity measure is used to determine the best split:

\[ \text{Gini}(A) = 2 p_{0, n}(A) p_{1, n}(A) \]

where:

  • \(p_{0, n}(A)\): Empirical probability of class 0 in cell \(A\).
  • \(p_{1, n}(A)\): Empirical probability of class 1 in cell \(A\).

Prediction:

  • Each tree makes a prediction using the majority class in the cell containing \(x\).
  • Classification uses a majority vote:

\[ m_{M, n}(x; \Theta_1, \ldots, \Theta_M, D_n) = \begin{cases} 1 & \text{if } \frac{1}{M} \sum_{j=1}^{M} m_n(x; \Theta_j, D_n) > \frac{1}{2} \\ 0 & \text{otherwise} \end{cases} \]

where:

  • \(m_n(x; \Theta_j, D_n)\): Prediction from the \(j\)-th tree.
  • \(M\): Total number of trees in the forest.

Like this…

[3]

The Data

Total Orders Closed Short Fulfilled
(n=7585) (n=733) (n=6852)
Top Customers
Smoothie Island 1701 (22.43%) 455 (62.07%) 1246 (18.18%)
Philly Bite 1556 (20.51%) 267 (36.43%) 1289 (18.81%)
PlatePioneers 1396 (18.40%) 143 (19.51%) 1253 (18.29%)
Berl Company 906 (11.94%) 5 (0.68%) 901 (13.15%)
DineLink Intl 589 (7.77%) 42 (5.73%) 547 (7.98%)
Top Products
DC-01 1135 (14.96%) 345 (47.07%) 790 (11.53%)
TSC-PQB-01 1087 (14.33%) 389 (53.07%) 698 (10.19%)
TSC-PW14X16-01 848 (11.18%) 283 (38.61%) 565 (8.25%)
CMI-PCK-01 802 (10.57%) 288 (39.29%) 514 (7.50%)
PC-05-B1 745 (9.82%) 220 (30.01%) 525 (7.66%)
Top Distributors
Ed Don & Company - Miramar 210 (2.77%) 0 (0.00%) 210 (3.06%)
PFG- Gainesville 197 (2.60%) 0 (0.00%) 197 (2.88%)
Ed Don & Company -   Woodridge 186 (2.45%) 0 (0.00%) 186 (2.71%)
Ed Don & Company - Mira   Loma 180 (2.37%) 0 (0.00%) 180 (2.63%)
.Ed Don - Miramar 162 (2.14%) 0 (0.00%) 162 (2.36%)
Top Substrates Paper Plastic Bagasse
Revenue($103,826,286) $54,838,585 (52.82%) $40,336,669 (38.85%) $4,350,337 (4.19%)
Quantity Ordered Min Mean Max
Total Ordered(1,971,237) 1 61.47 23,160
Unit Price Min Mean Max
Key Stats $0.16 $62.60 $864.00
Total Price Min Mean Max
Key Stats $4.92 $3,430.74 $143,084.74

Analysis - Stutti

Predicting Customer Churn

  • The churn indicator was created based on the Last Sales Date (0/1).
  • Predictors: Class, Product, Qty Ordered, and Date Fulfilled.
  • The model was evaluated using statistics from the Confusion Matrix.
  • 80% Accuracy achieved:
    • Sensitivity: The model correctly identifies 78.6% of the actual 0 cases.
    • Specificity: The model correctly identifies 88.12% of the actual 1 cases.
    • Negative Predictive Value (NPV for class 1): When the model predicts 1, it is correct only 47.62% of the time. This lower NPV suggests the model might be missing some 1 cases.
    • McNemar’s Test P-value (<2e-16): Indicates that the model struggles slightly with misclassification between classes.
  • Conclusion: Overall, the model has a good balance (0.8336) between identifying both classes, though it is better at predicting class 0.

Analysis - Matt

Random Forest Model Summary

  • The Random Forest model was trained to predict SalesOrderStatus (Fulfilled vs. Unfulfilled) using 100 trees and mtry = 2.
  • OOB Error Rate: 17.52%, indicating expected prediction errors on unseen data.

Model Performance Metrics

  • Accuracy: 77%, showing the model correctly classified 77% of test samples.
  • 95% Confidence Interval (CI): The accuracy falls within a 95% CI of (0.759, 0.781), reflecting the precision of the estimate.
  • Kappa: The Kappa statistic, measuring the agreement beyond chance, was 0.055, indicating low agreement beyond random chance.
  • Sensitivity: 0.06198, showing a low ability to identify “Closed Short” cases.
  • Specificity: 0.97642, indicating high accuracy in identifying “Fulfilled” cases.
  • PPV (Closed Short): 43.39%, suggesting the model correctly predicts “Closed Short” 43.39% of the time.
  • Balanced Accuracy: 0.5192, considering both sensitivity and specificity.

Conclusions

  • The model showed high specificity but low sensitivity, performing better at predicting “Fulfilled” cases.
  • UnitPrice and Product were the most significant predictors for classification.
  • Kappa statistic of 0.055 means that the overall agreement between the model’s predicted SalesOrderStatus (e.g., “Fulfilled” vs. “Closed Short”) and the actual status is very low beyond what could be expected by random guessing.

Analysis - Mika

Random Forest Model Summary

  • The Random Forest model was trained to predict QuantityFulfilled using 100 records of sales data.
  • The model used 8 predictor variables and 100 bootstrap iterations.
  • Bootstrap validation was performed to assess model stability and performance.

Model Performance Metrics

  • Original MSE: 800.89 units²
  • RMSE: 28.30 units (shows average prediction error in original units)
  • MAE: 13.09 units (shows average absolute prediction error)
  • Bias: -204.04 (indicates the model tends to underestimate)
  • Standard Error: 292.65 (shows variability in predictions)
  • 95% Confidence Interval: (198.3, 1449.2) for MSE

Conclusions

  • The model predicts QuantityFulfilled with an average error of about 28 units (RMSE).
  • Predictions are typically off by about 13 units (MAE).
  • The model shows consistent negative bias (-204.04), suggesting systematic underestimation.
  • A wide confidence interval (198.3 to 1449.2) indicates high variability in prediction accuracy.
  • Numerical variables (qtyOrdered, TotalPrice) are substantially more important than categorical ones.

References

[1]
G. Biau and E. Scornet, “A random forest guided tour,” Test (Madr.), vol. 25, no. 2, pp. 197–227, Jun. 2016.
[2]
[3]
Y. Fu, “Combination of random forests and neural networks in social lending,” Journal of Financial Risk Management, vol. 6, no. 4, pp. 418–426, 2017, doi: 10.4236/jfrm.2017.64030.